STATS 32 Session 6: Functions and more data transformation

Kenneth Tay

Oct 10, 2019

Recap of session 5

Recap of session 5

ALL of these functions take:

  1. A dataset, and
  2. Instructions on what to do with the dataset.

Recap of session 5

ALL of these functions take:

  1. A dataset, and
  2. Instructions on what to do with the dataset.

The dataset is either:

  1. The first argument within the function’s parentheses, e.g.
select(df, day)

Recap of session 5

ALL of these functions take:

  1. A dataset, and
  2. Instructions on what to do with the dataset.

The dataset is either:

  1. The first argument within the function’s parentheses, or
  2. Passed to the function through a “pipe” %>%, e.g.
df %>% select(day)

Recap of session 5

ALL of these functions return a dataset!

You can do three things with this returned dataset:

  1. Nothing, in which case it prints to screen.
  2. Save it by assigning it to a variable.
  3. Don’t save it, but pass it on to another function using a “pipe” %>%

%>% syntax with dplyr

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

mtcars %>% 
    select(wt, mpg) %>% 
    filter(mpg < 15)

Agenda for today

Functions: R’s workhorse

A function is a named block of code which

(Source: codehs.gitbooks.io)

We use functions in R all the time

We’ve already seen a number of functions in R! For example,

is.character("123")
## [1] TRUE

The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.

Others we’ve seen: str(), head(), rm(), ggplot(), select(), …

We can see what a function does by typing in ? followed by the function name in the R console.

?is.character

Function syntax

The most important syntax in R is the function call. All R syntax has function calls underlying it.

A function call consists of:

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x)
## [1] NA

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x, na.rm = TRUE)
## [1] -1

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

The pipe operator %>%

library(magrittr)
x %>% abs() %>% mean(na.rm = TRUE)
## [1] 2.6

%>% syntax with dplyr

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

mtcars %>% 
    select(wt, mpg) %>% 
    filter(mpg < 15)

+ syntax with ggplot2

library(ggplot2)
ggplot(data = mtcars, mapping = aes(x = wt, y = hp)) +
    geom_point() +
    labs(title = "Horsepower vs. Weight", x = "Weight", 
         y = "Horsepower") +
    theme_classic()

Why + for ggplot2 only?

(Source: Twitter)

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

First answer: Google it! Google “R <function name>”

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

First answer: Google it! Google “R <function name>”

A (probably) better answer: Documentation in R itself!

sample(): Description

sample(): Usage

What comes after the = sign: default value for that argument

sample(): Arguments

sample(): Details

sample(): Value

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1]  8  7  2  3  1  5  9 10  4  6

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1] 10  8  9  3  1  5  6  4  2  7
sample(1:10, 10, TRUE)
##  [1]  9  2  1 10 10  7 10  7  8  3

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1]  6  2  5  9  4  1  8  7 10  3
sample(1:10, 10, TRUE)
##  [1] 10  6  4  1  7  3  5  4  6  1
sample(1:10, TRUE, size = 5)
## [1] 3 8 7 6 2

tidyr::gather()

E.g. dataset of no. of cases for each country

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Probably want something like

ggplot(df) +
    geom_line(aes(x = year, y = cases, group = country))

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Problem: Column names are values of the variable year.

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset:

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using tidyr’s gather()

(Source: R for Data Science)

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using tidyr’s gather()

df %>% gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using tidyr’s gather()

df %>% gather(`1999`, `2000`, key = "year", value = "cases") %>%
    ggplot() +
    geom_line(aes(x = as.numeric(year), y = cases, col = country))

tidyr::separate()

E.g. dataset of rate (cases / population) for each country

df
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

tidyr::separate()

How to get cases and population into columns of their own?

df
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

tidyr::separate()

How to get cases and population into columns of their own?

Solution: Use tidyr’s separate()

(Source: R for Data Science)

tidyr::separate()

How to get cases and population into columns of their own?

Solution: Use tidyr’s separate()

df %>% separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Today’s dataset: Drought in California

Data source: United States Drought Monitor (USDM)

USDM: data download

USDM: data selection

The data in Excel









Optional material

USDM: data selection details

tidyr functions: gather and spread

gather: Used when some column names are not variables, but values of a variable

(Source: R for Data Science)

spread: Opposite of gather

(Source: R for Data Science)

tidyr functions: separate and unite

separate: Used to separate values in one column into multiple columns

(Source: R for Data Science)

unite: Opposite of separate

(Source: R for Data Science)